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The  project  proposal  discussed  two  problem  areas>  »  * 

— <1)  The  solution  of  large  sparse  systems  of  linear  equations*  /‘••Js 
.  (2)  The  solution  of  sparse  least  squares  problems. 

We  report  significant  progress  in  both  of  these  areas  and  in  a  third  area,  the 
solution  of  the  algebraic  eigenvalue  problem. 


The  progress  in  solving  systems  of  linear  equations  included  an  algorithm  for 
computing  ordering  for  efficiently  factoring  sparse  symmetric,  positive  definite 
systems  in  parallel.  We  also  made  important  progress  in  computing  the  ordering 
itself  in  parallel.  Other  progress  included  a  method  for  handling  singular  blocks 
in  a  one-way  dissection  ordering  and  an  error  analysis  of  Gaussian  elimination  in 
unnormalized  arithmetic. 
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For  linear  least  squares  problems  we  developed  an  efficient  reliable  method 
for  detecting  the  rank  of  a  sparse  matrix  without  column  exchanges.  The 
method  used  a  static  data  structure.  We  also  analyzed  and  compared  methods 
for  computing  sparse  and  dense  QR  factorizations  on  message  passing 
architectures . 

On  the  algebraic  eigenvalue  problem,  we  participated  in  resolving  long  standing 
open  questions  on  relative  perturbation  bounds  on  certain  diagonally  dominant 
eigenvalue  problems. 
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This  proposal  considered  two  major  problems  in  sparse  matrix  computations:  the  solution  of 
sparse  systems  of  linear  equations  and  the  solution  of  unconstrained  and  constrained  least  squares 
problems. 

1.  Sparse  systems  of  linear  equations.  We  have  designed  algorithms  for  computing  order¬ 
ings  for  efficiently  factoring  sparse,  symmetric  positive  definite  matrices  in  parallel.  Jess  and  Kees 
(1982)  had  suggested  a  two-step  approach  for  computing  a  good  parallel  ordering:  (i).  compute  a 
fill-reducing  ordering  of  the  matrix;  (ii).  reorder  the  filled  chordal  graph  of  the  Cholesky  factor  L 
from  the  step  (i)  to  obtain  an  ordering  that  permits  the  maximum  possible  parallelism  at  each  step, 
without  incurring  any  additional  fill.  The  ordering  obtained  from  step  (ii)  leads  to  the  shortest 
elimination  tree  of  the  Cholesky  factor  L  which  preserves  the  fill  in  L 

Previous  algorithms  for  implementing  step  (ii)  had  space  and  time  requirements  that  were 
much  greater  than  the  requirements  of  step  (i);  hence  these  algorithms  were  impractical  for  large 
problems.  In  [l],  in  joint  work  with  Lewis  and  Peyton,  we  designed  an  algorithm  for  the  second 
step  that  was  linear  in  the  number  of  compressed  subscripts  for  L  by  making  use  of  a  new  data 
structure,  the  clique  tree.  This  algorithm  requires  much  less  space  and  time  than  the  initial  fill- 
reducing  ordering.  We  also  provided  some  justification  for  this  two-step  approach  by  showing  that 
the  problem  of  computing  the  parallel  ordering  of  A  that  leads  to  the  shortest  elimination  tree 
irrespective  of  the  fill  is  NP-complete  and  hence  intractable  [3]. 

We  have  considered  an  algebraic  approach  to  computing  good  orderings  in  parallel.  The  pa¬ 
per  [12]  (joint  work  with  Simon)  concerns  the  design  of  a  parallel  algorithm  for  computing  the 
parallel  ordering.  It  presents  a  solution  to  the  problem  of  computing  good  separators  (this  is  one 
step  in  a  parallel  ordering  algorithm),  which  makes  use  of  the  eigenvectors  of  the  Laplacian  ma¬ 
trix  of  a  graph.  This  spectral  approach  computes  smaller  separators  than  the  Automated  Nested 
Dissection  algorithm.  We  have  also  shown  that  lower  bounds  on  separator  sizes  can  be  obtained 
from  the  eigenvalues  of  the  Laplacian  matrix.  A  third  paper  (in  preparation)  [11]  make.;  use  of  the 
above  algebraic  approach  for  a  parallel  algorithm  for  reducing  the  envelope  of  sparse  matrices. 

We  have  also  investigated  the  role  p]ayed  by  the  clique  tree  data  structure  (introduced  to 
compute  a  Jess  and  Kees  ordering)  in  other  sparse  matrix  problems. 

We  have  completed  an  implementation  [14]  of  a  multifrontal  sparse  Cholesky  factorization 
algorithm  for  an  iPSC/2  hypercube,  and  are  involved  in  studying  the  influence  of  good  orderings, 
mappings,  and  clique  tree  structures  on  its  efficiency.  In  other  work  [2],  we  have  been  studying  what 
constitutes  a  ‘supernode’,  a  group  of  columns  of  the  Cholesky  factor  L  that  forms  a  maximal  dense 
submatrix.  The  difficulty  is  that  this  concept  depends  on  the  algorithmic  context.  It  can  be  shown 
that  (according  to  one  definition)  a  set  of  supernodes  can  be  obtained  from  the  clique  tree  by  an 
O(n)  time  algorithm.  (Here  n  is  the  order  of  the  matrix.)  Supernodes  have  been  used  by  Ashcraft 
et  al.  (1987)  to  enable  vectorization  in  the  computation  of  sparse  numerical  factorization  at  speeds 
comparable  to  dense  matrix  factorizations  on  vector  supercomputers.  The  report  [4]  describes  our 
preliminary  study  of  supernodes  in  sparse  factorizations. 

The  clique  tree  is  a  compact  representation  of  the  structure  of  the  Cholesky  factor  L.  Never¬ 
theless,  at  the  expense  of  some  computation,  it  is  possible  to  obtain  more  compact  representations. 
In  [13],  we  have  investigated  a  data  structure  called  the  compact  clique  tree.  This  data  structure 
has  applications  in  communication  efficient  parallel  factorization  algorithms  and  in  storage  efficient 
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out-of-core  algorithms.  We  have  proved  that  the  compact  clique  tree  never  requires  storage  greater 
than  the  skeleton  graph  (a  concept  introduced  by  Liu),  and  on  some  worst-case  examples  requires 
only  0(n)  storage  when  the  skeleton  graph  requires  0(n2)  storage.  On  all  of  the  computational 
problems  that  we  have  experimented  with,  the  two  require  almost  identical  storage;  this  is  typically 
about  a  fourth  of  the  storage  needed  for  the  clique  tree. 

2.  Linear  least  squares  problems.  Work  has  been  done  on  several  issues  that  arise  in  the 
solution  of  dense  and  sparse  least  squares  problems  on  both  sequential  and  parallel  computers. 

An  algorithm  for  computing  the  block  upper  triangular  form  of  rectangular  and  square  sparse 
matrices  was  described  and  implemented  in  [6].  This  form  computes  the  irreducible  blocks  of  the 
given  matrix,  and  since  only  these  blocks  need  to  be  factored  to  solve  linear  systems  and  least 
squares  problems,  can  save  the  storage  and  time  needed  to  factor  the  given  matrix.  It  is  also  is 
useful  in  correctly  computing  the  nonzero  structures  of  the  factor  matrices  Q  and  R  during  symbolic 
factorization. 

The  paper  [5]  presents  an  algorithm  for  computing  a  sparse  basis  for  the  null  space  of  the  equi¬ 
librium  matrix  of  a  physical  structure.  This  problem  arises  in  the  solution  of  equality  constrained 
least  squares  problems  in  structural  analysis.  By  making  use  of  the  additional  information  available 
in  the  physical  situation,  sparser  null  bases  could  be  computed;  the  algorithm  was  also  faster  than 
previous  algorithms. 

Several  parallel  algorithms  for  computing  the  orthogonal  factorization  of  a  dense  matrix  on 
a  distributed  memory  multiprocessor  were  described  in  [7,8,9].  Both  Givens  and  Householder 
orthogonalization  algorithms  were  considered.  The  time  and  communication  complexities  of  the 
algorithms  were  analyzed,  and  shown  to  agree  with  the  time  taken  by  the  algorithms. 

The  above  work  laid  the  foundation  for  our  work  on  the  sparse  linear  least  squares  problem. 
The  paper  [10]  describes  a  parallel  algorithm  for  the  numerical  factorization  step  (the  dominant 
step  in  the  time  complexity)  in  the  orthogonal  decomposition  of  a  sparse  matrix  on  a  hypercube 
multiprocessor.  This  algorithm  computes  the  orthogonal  factorization  by  means  of  a  sequence  of 
submatrix  merges  involving  upper  trapezoidal  matrices.  The  merges  are  performed  by  the  use  of 
row  oriented  Householder  transformations,  and  is  organized  around  a  merge  tree  data  structure. 
We  have  shown  that  this  algorithm  has  small  arithmetic  and  communication  complexiti  ?s  and  we 
obtain  good  parallel  efficiencies  in  our  implementation  on  an  iPSC/2  hypercube  multiprocessor. 
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The  project  proposal  discussed  two  problems  in  sparse  matrix  computations:!  1 ) 
the  solution  of  systems  of  linear  equations  ;  (2)  the  solution  of  linear  least  squares 
problems.  We  made  progress  in  both  of  these  areas  and  in  a  third  area:  the  solution 
of  the  algebraic  eigenvalue  problem.  The  work  in  linear  least  squares  resulted  in  a 
new  direction  not  discussed  in  the  proposal.  Our  results  have  implications  in  the 
development  of  stable  parallel  algorithms  to  solve  these  problems. 

1.  The  Solution  of  Linear  Systems  of  Equations.  This  part  of  the  effort 
yielded  two  significant  results.  The  first  concerned  the  technique  of  one-way  dissection 
or  domain  decomposition.  It  is  commonly  used  in  the  numerical  solution  of  partial 
differential  equations.  The  resulting  system  has  a  block  bordered  form  which  is  nearly 
ideal  for  parallel  computation. 

In  some  contexts,  in  particular,  in  the  solution  of  Stokes  flow  problems  [22]  ,  the 
diagonal  blocks  may  be  singular  even  though  the  matrix  is  nonsingular.  We  developed 
an  improvement  of  a  direct  method  for  solving  this  problem  due  to  Gunzberger  and 
Nicholaides  [22].  We  examined  the  stability  of  various  methods  for  resolving  the 
singularity  and  proposed  a  better  back  substitution  algorithm  [8].  We  tested  our 
algorithm  on  the  Intel  iPSC/1  and  it  produced  linear  speedups. 

Special  purpose  devices  have  an  important  role  in  sparse  matrix  computations. 
Digit  online  arithmetic  is  often  used  for  special  purpose  devices  because  its  ability  to 
“pipe  on  digits”  speeds  up  computations  by  factors  ranging  from  2  to  16  [20].  A  PhD 
thesis  from  Penn  State  [29]  pointed  out  that  such  arithmetics  produce  unnormalized 
results  and  thus  numerical  algorithms  may  not  have  the  same  stability  properties. 
Cavellaroet  al.[  15]  recently  used  this  arithmetic  in  the  design  of  a  robot  control  device. 

As  an  example,  we  considered  the  error  analysis  of  Gaussian  elimination  in  un¬ 
normalized  arithmetic  [12].  The  algorithm  exhibits  exactly  the  same  properties  as  in 
standard  arithmetic  for  the  diagonally  dominant  matrices  that  arise  in  the  numerical 
solution  of  partial  differential  equations.  For  general  matrices,  there  are  subtle  dif¬ 
ferences  between  the  error  analysis  for  unnormalized  arithmetic  and  standard  floating 
point  arithmetic. 

2.  The  Solution  of  Linear  Least  Squares  Problems.  We  developed  algo¬ 
rithms  for  two  different  constrained  least  squares  problems.  We  have  also  developed 
an  efficient  method  for  detecting  the  rank  of  a  sparse  matrix.  The  method  allows 
one  to  use  a  static  data  structure  throughout  the  algorithm.  Finally,  we  proved  the 
stability  of  a  parallel  method  for  computing  the  sample  variance. 

For  the  constrained  least  squares  problem 

(1)  min  || /-Ex  ||2 

subject  to  the  constraints 

(2)  Cx  =  g 

where  C  is  an  mi  x  n  matrix,  E  is  an  m?  x  n  matrix,  we  made  progress  on  direct 
approaches.  The  error  analysis  of  a  weighting  procedure  with  deferred  correction 
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was  completed  [5,1].  A  sparse  version  of  the  algorithm  was  developed  in  [10,11,28], 
There  were  two  important  results  from, this.  The  first  is  that  the  deferred  correction 
procedure  converges  in  only  two  iterations  for  a  large  class  of  problems  [11].  This 
development  also  lead  to  a  more  robust  implementation  of  the  strategy.  The  second 
development  concerned  the  the  problem  of  accurately  detecting  the  rank  of  C  [10] 
which  is  crucial  to  the  solution  of  (l)-(2).  We  developed  a  strategy  which  is  about  as 
accurate  as  maximal  column  pivoting  and  is  provably  more  accurate  than  the  strategy 
in  SPARSPAK-B  [21].  The  strategy  allows  one  to  use  a  static  data  structure.  It  can 
be  used  to  detect  the  rank  of  a  general  sparse  matrix  and  does  not  require  access  to  the 
elimination  tree  as  does  a  related  approach  by  Bischof,  Pierce,  and  Lewis  [13].  This 
rank  detection  procedure  can  also  be  used  in  conjunction  with  the  iterative  procedure 
for  (l)-(2)  given  by  Barlow,  Nichols,  and  Plemmons  [6]. 

Joint  work  with  G.  Toraldo  [7]  is  in  progress  on  the  solution  of  the  bound  con¬ 
strained  quadratic  programming  problem 


(3) 


min  -uT Au  —  uTb 
2 


subject  to 


(4)  c  <  x  <  d 

where  c  and  d  are  n-vectors.  We  have  considerd  the  projected  gradient  strategy  for 
this  problem  due  to  More’  and  Toraldo  [26,27].  This  strategy  tends  to  find  the  active 
set  much  faster  than  classical  active  set  strategies.  We  show  that  if  A  can  be  scaled 
to  the  form 


A  =  I  -  N 

where  N  is  symmetric,  positive  definite,  then  the  strategy  will  always  take  large  pro¬ 
jected  gradient  steps.  The  results  can  be  strengthened  considerably  if  A  is  a  Steiltjes 
matrix,  and  the  constraint  (4)  is 

(5)  u  >  0. 

In  that  case,  if  A  is  positive  definite,  we  are  solving  the  linear  complementary  prob¬ 
lem.  Our  simplifications  to  the  projected  gradient  strategy  avoid  evaluations  of 
F(u )  =  \uT Au  -  uTb  entirely.  Our  strategy  solved  a  free  boundary  problem  with 
3337  variables  in  85.62  seconds  on  a  SUN/4.  The  underlying  linear  equation  solver 
was  Jacobi  preconditioned  conjugate  gradient.  However,  we  think  that  the  method 
may  be  suitable  for  use  with  direct  factorization  methods  for  .4.  Preliminary  results 
are  in  the  report  [7], 

An  incidental  result  was  the  error  analysis  of  an  algorithm  due  to  Chan,  Golub, 
and  LeVeque  [16]  for  computing  the  sample  variance.  The  algorithm  computes  the 
variance  of  a  sample  of  size  n  in  0(log  n)  parallel  steps  with  only  one  pass  through 
the  sample  data.  Its  stability  had  been  an  open  question.  Our  analysis  showed  that 
the  algorithm  was  indeed  stable. 

3.  The  Solution  of  the  Algebraic  Eigenvalue  Problem.  The  algebraic  eigen¬ 
value  problem  is  that  of  solving 

(6) 


Ax  =  \Bx 


for  the  scalar  A  and  the  n-vector  x.  In  our  research,  we  considered  the  case  where  A 
and  B  are  symmetric  n  X  n  matrices  and  B  is  positive  definite.  The  problem  (6)  arises 
in  structural  analysis. 

In  [4],  we  considered  the  problem  where  A  is  of  the  form 
(7)  4  =  A(£  +  iV)A. 

Here  A  and  E  are  diagonal  matrices,  A  is  nonsingular,  ||  N  ||2=  7  <  1,  and  the 
diagonals  of  E  are  ±1.  Thus  A  is  diagonally  dominant  in  the  Euclidean  matrix  norm. 
This  class  of  matrices  includes  all  consistently  ordered  diagonally  dominant  matrices 
that  are  solvable  by  the  classical  iterative  methods. 

We  show  that  for  the  case  where  B  =  /  or  A  is  positive  definite,  we  car,  ob¬ 
tain  much  better  perturbation  bounds  on  both  the  eigenvalues  and  eigenvectors  than 
presently  given  in  the  literature.  These  results  resolved  a  well  known  open  question 
regarded  relative  error  bounds  on  eigenvalues. 

These  bounds  led  to  more  robust  algorithms  that  will  be  incorporated  in  the  LA- 
PACK  linear  algebra  library  [4]  for  shared  memory  high  performance  architectures. 
Some  of  the  algorithmic  questions  arising  from  this  analysis  have  been  answered.  So 
far,  only  bisection  followed  by  inverse  iteration  has  been  shown  to  satisfy  all  of  the 
bounds  discussed  in  [4].  For  the  singular  value  decomposition,  the  Jacobi  algorithm 
[18]  achieves  these  bounds  and  the  QR  algorithm  achieves  them  for  the  bidiagonal 
SVD.  However,  the  QR  for  general  symmetric  tridiagonal  matrices  does  not  achieve 
relative  accuracy.  For  the  new  divide-and-conquer  algorithms  that  have  been  consid¬ 
ered  for  implementation  in  LAPACK,  no  results  on  relative  accuracy  have  been  proven 
or  disproven. 

In  [24,25,23],  we  develop  bounds  on  the  eigenvalues  of  banded  Toeplitz  matrices. 
We  also  develop  a  method  for  finding  the  eigenvalues  of  these  matrices  that  uses  the 
Bunch,  Nielsen,  and  Sorensen  [14]  update  procedure.  For  banded  symmetric  Toeplitz 
matrices,  the  procedure  is  faster  than  the  QR  algorithm  and  just  as  stable. 
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